Building Large Chinese Corpus for Spoken Dialogue Research in Specific Domains

نویسندگان

  • Changliang Li
  • Xiuying Wang
چکیده

Corpus is a valuable resource for information retrieval and data-driven natural language processing systems, especially for spoken dialogue research in specific domains. However, there is little non-English corpora, particular for ones in Chinese. Spoken by the nation with the largest population in the world, Chinese become increasingly prevalent and popular among millions of people worldwide. In this paper, we build a large-scale and high-quality Chinese corpus, called CSDC (Chinese Spoken Dialogue Corpus). It contains five domains and more than 140 thousand dialogues in all. Each sentence in this corpus is annotated with slot information additionally compared to other corpora. To our best knowledge, this is the largest Chinese spoken dialogue corpus, as well as the first one with slot information. With this corpus, we proposed a method and did a welldesigned experiment. The indicative result is reported at last.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

The Design of a Multi-domain Chinese Dialogue System

For a multi-domain spoken dialogue system, there are two major difficulties in building its dialogue management. One is to interpret users' interested domain correctly and switch between domains smoothly. The other is the high cost of merging the dialogue management of different application domains. To solve these problems, an implicit domain identification and switching mechanism for dialogue ...

متن کامل

Keynote: Statistical Approaches to Open-domain Spoken Dialogue Systems

In contrast to traditional rule-based approaches to building spoken dialogue systems, recent research has shown that it is possible to implement all of the required functionality using statistical models trained using a combination of supervised learning and reinforcement learning. This approach to spoken dialogue is based on the mathematics of partially observable Markov decision processes (PO...

متن کامل

Efficient language model development for spoken dialogue recognition and its evaluation on operator's speech at call centers

While a language model for recognition of spoken dialogue is ideally built from a very large, specific-task-oriented corpus, a great amount of time and effort is required to develop such a corpus, and this involves both the audio recording and written transcription of large amounts of speech data. Training data for a language model should match the target task in both topic and style. What is n...

متن کامل

Collecting Voices from the Cloud

The collection and transcription of speech data is typically an expensive and time-consuming task. Voice over IP and cloud computing are poised to greatly reduce this impediment to research on spoken language interfaces in many domains. This paper documents our efforts to deploy speech-enabled web interfaces to large audiences over the Internet via Amazon Mechanical Turk, an online marketplace ...

متن کامل

Behavior Specific User Simulation in Spoken Dialogue Systems

Spoken dialogue systems provide an opportunity for man machine interaction using spoken language as the medium of interaction. In recent years reinforcement learning-based dialogue policy optimization has evolved to be state of the art. In order to cope with the data requirement for policy optimization and also to evaluate dialogue policies user simulators are introduced. Almost all existing da...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2017